library(tidyverse)
library(maptools)
library(sp)
library(rgdal)
library(choroplethrMaps)
library(choroplethr)
library(ggridges)
library(forcats)
library(GGally)
library(gridExtra)
library(extracat)
library(plotly)

I. Introduction

II. Description of the data source

We used Community Health Status Indicators (CHSI) dataset, which was published by the “Centers for Disease Control and Prevention” on website https://catalog.data.gov/dataset/community-health-status-indicators-chsi-to-combat-obesity-heart-disease-and-cancer. It was created on June 19, 2015, and updated on February 26, 2019. There were 9 tables and contained over 200 measures for each of the 3,141 United States counties. We mainly used four tables in this file.

Tables Description

Table1 (Demographics): “demographics.csv”

This table provided overall characteristics of each county. The data were obtained from the Current Population Survey (CPS) conducted by the U.S. Bureau of the Census. The CPS is an ongoing survey of states from which estimates for counties are derived.

Table2 (Health Record): “summary measures of health.csv”

This table contained broad measures of health. Each measure captures a single, comprehensive measure of population-based health. Those measures were generated from different sources. Average Life Expectancy was calculated by Chris Murray and colleagues at the Harvard School of Public Health. All Causes of Death came from the National Vital Statistics System, National Center for Health Statistics. Health Status and Average Number of Unhealthy Days are provided by the Behavioral Risk Factor Surveillance System (BRFSS), a survey conducted jointly by states and the Centers for Disease Control and Prevention.

Table3 (Risk Factors): “risk factors and access to care.csv”

This table provides information on the prevalence of adult risk characteristics associated with the leading causes of death. It was conducted by The Behavioral Risk Factor Surveillance System (BRFSS).

Table4 (Diseases): “measures of birth and death.csv”

This table contained birth rate and death rate due to certain reasons. We mainly used the mortality data, and they came from the National Center for Health Statistics, National Vital Statistics System.

Variables We Used:

Table1 (Demographics): Population Size(Population_Size); Poverty level(Poverty).
Table2 (Health Record): Average Life Expectancy(ALE); All Causes of Death(All_Death); Self-rated Health Status(Health_Status); Average Number of Unhealthy Days in Past Month(Unhealthy_Days).
Table3 (Risk Factors): No exercise(No_Exercise); Few fruits/vegetables(Few_Fruit_Veg); Obesity(Obesity); High Blood Pressure(High_Blood_Pres); Smoker(Smoker); Diabetes(Diabetes).
Table4 (Diseases): Breast Cancer (Female)(Brst_Cancer); Colon Cancer(Col_Cancer); Coronary Heart Disease(CHD); Lung Cancer(Lung_Cancer); Stroke(Stroke).

Known Issues

1. Some data were estimated using a small sample size, so the data were not really accurate. The confidence interval for these measures may be wide.
2. Due to huge numbers of counties and large numbers of variables, there were many missing values (no data available / no report).

III. Description of data import / cleaning / transformation

## Data
Demo<-read.csv("demographics.csv")
Health.Summary<-read.csv("summary measures of health.csv")
Risk<-read.csv("risk factors and access to care.csv")
Birth.Death<-read.csv("measures of birth and death.csv")


Health<-cbind(Demo[,c(1:6,9,12,15,30,33,36,39,42)], Health.Summary[,c(7,11,17,23)],Risk[,c(7,10,13,16,19,22)],Birth.Death[,c(85,91,97,109,121)])

## Identify NA
for (i in 7:29){
  Health[,i]<-ifelse(Health[,i]<0,NA,Health[,i])
}


## Health by State
Health.State<-Health %>% 
  group_by(CHSI_State_Name) %>% 
  summarize(State_FIPS_Code=mean(State_FIPS_Code),
            Population.Size=sum(Population_Size),
            Poverty=sum(Population_Size*0.01*Poverty, na.rm=TRUE)/sum(Population_Size,na.rm = TRUE)*100, 
            Life_exp=mean(ALE, na.rm=TRUE),
            All_Death=sum(All_Death, na.rm = TRUE)/sum(Population_Size, na.rm = TRUE)*100,
            Health_Status=sum(Population_Size*0.01*Health_Status,na.rm=TRUE)/sum(Population_Size)*100,
            Unhealthy_Days=mean(Unhealthy_Days,na.rm=TRUE),
            No_Exercise=sum(Population_Size*0.01*No_Exercise, na.rm=TRUE)/sum(Population_Size, na.rm = TRUE)*100,
            Few_Fruit_Veg=sum(Population_Size*0.01*Few_Fruit_Veg, na.rm=TRUE)/sum(Population_Size, na.rm = TRUE)*100,
            Obese=sum(Population_Size*0.01*Obesity, na.rm=TRUE)/sum(Population_Size, na.rm = TRUE)*100,
            High_Blood_Pres=sum(Population_Size*0.01*High_Blood_Pres, na.rm=TRUE)/sum(Population_Size, na.rm = TRUE)*100,
            Smoker=sum(Population_Size*0.01*Smoker, na.rm=TRUE)/sum(Population_Size, na.rm = TRUE)*100,
            Diabetes=sum(Population_Size*0.01*Diabetes, na.rm=TRUE)/sum(Population_Size, na.rm = TRUE)*100,
            Brst_Cancer=sum(Brst_Cancer,na.rm = TRUE)/sum(Population_Size, na.rm = TRUE)*100,
            Col_Cancer=sum(Col_Cancer,na.rm = TRUE)/sum(Population_Size, na.rm = TRUE)*100,
            CHD=sum(CHD,na.rm = TRUE)/sum(Population_Size, na.rm = TRUE)*100,
            Lung_Cancer=sum(Lung_Cancer,na.rm = TRUE)/sum(Population_Size, na.rm = TRUE)*100,
            Stroke=sum(Stroke,na.rm = TRUE)/sum(Population_Size, na.rm = TRUE)*100)
Health.State[,c(4:19)]<-round(Health.State[,c(4:19)],2)
Health.State$CHSI_State_Abbr<-names(table(Health$CHSI_State_Abbr))
Health.State[Health.State$CHSI_State_Name=="Alaska",]<-ifelse(Health.State[Health.State$CHSI_State_Name=="Alaska",]==0, NA, Health.State[Health.State$CHSI_State_Name=="Alaska",])
write.csv(Health.State,"Health.State.csv")
Using “Health.csv”, we developed another table “Health.State.csv,” which contains the information for each state. Within each state, we calculated the average rate for each variable. To calculate the average rate, we first used the population size of each county times the rate (of each county) to get the number of people. We grouped the data by state and summed up the number of people. Then using the number of people (of each state) divided by the state’s total population to get the state average rate. We did this for most variables. In this table, only Alaska had several missing values.

IV. Analysis of missing values

a. Missing Values from ‘Health’ Dataset

visna(Health, sort = "b")

In “Health.csv”, there are many missing values, since there are 3141 counties and it is hard to collect all the data. When we look vertically, among those variables, table3–risk factors is the one that had the highest number of missing values. In particular, around 50% of counties did not have High Blood Pressure(High_Blood_Pres) record. If we look horizontally, we can tell that most counties had complete records.

b. Missing Values from ‘Health.State’ Dataset

visna(Health.State, sort = "b")

In “Health.State.csv”, only one state (Alaska) had missing values. It missed eight variables, and most of them are risk factors.
In the maps, we displayed the missing values using black. In other graphs, we ignored those missing values.

V. Results

I. Relationship Between all Variables

a. Correlation Plot of All Variables

## correlation grid
ggcorr(Health.State[,-c(1:2,20)], label = TRUE, label_round = 2, hjust = 0.75, size = 3, label_size = 2, title="Correlation Plot")+labs(title = "Correlation Plot")

This plot shows the correlation between each pair of variables. Red represents a positive correlation and blue means a negative correlation. The lighter color means a weaker relationship. From the above grid, here are some of our findings.
1. The poverty rate has a strong correlation with life expectancy, Health status, and some risk factors (No exercise, Obesity, Diabetes). In particular, a higher rate of poverty may imply a shorter life expectancy, a higher rate of poor health status, more unhealthy days in the past month, and a higher rate of risk factors.
2. Health status is highly correlated with risk factors. The states with a high rate of risk factors tend to have higher unhealthy rates.
4. To our suprise, life expectancy has high correlation with risk factors, but not diseases.
5. High blood pressure has a negative correlation with the listed diseases (Breast Cancer (Female), Colon Cancer, Coronary Heart Disease, Lung Cancer, Stroke). Few fruits/vegetables, Obese and Smoker have moderate correlation with Diseases.
6. Most risk factors are correlated with each other. One exception is that Few fruits/vegetables has less correlation Diabetes.
7. Diseases are highly correlated. It means the state with high rate of one disease is likely to have high rate of other diseases also.

b. Scatter Plot of Risk Factors and Diseases Death

tidydf<-gather(Health.State, key = "Diseases", value = "Percentage", colnames(Health.State%>%select(15:19)))

## Scatter plot (no exercise)
sc1<-ggplot(data = tidydf, mapping = aes(x = No_Exercise, y = Percentage)) +
  geom_point(aes(color = Diseases),size=0.7,alpha=0.7)+ylab("Diseases Percentage")

## Scatter plot (few fruit and veg)
sc2<-ggplot(data = tidydf, mapping = aes(x = Few_Fruit_Veg, y = Percentage)) +
  geom_point(aes(color = Diseases),size=0.7,alpha=0.7)+ylab("Diseases Percentage")

## Scatter plot (Obesity)
sc3<-ggplot(data = tidydf, mapping = aes(x = Obese, y = Percentage)) +
  geom_point(aes(color = Diseases),size=0.7,alpha=0.7)+ylab("Diseases Percentage")

## Scatter plot (High blood pre)
sc4<-ggplot(data = tidydf, mapping = aes(x = High_Blood_Pres, y = Percentage)) +
  geom_point(aes(color = Diseases),size=0.7,alpha=0.7)+ylab("Diseases Percentage")

## Scatter plot (Smoker)
sc5<-ggplot(data = tidydf, mapping = aes(x = Smoker, y = Percentage)) +
  geom_point(aes(color = Diseases),size=0.7,alpha=0.7)+ylab("Diseases Percentage")

## Scatter plot (Diabetes)
sc6<-ggplot(data = tidydf, mapping = aes(x =Diabetes, y = Percentage)) +
  geom_point(aes(color = Diseases),size=0.7,alpha=0.7)+ylab("Diseases Percentage")

grid.arrange(sc1, sc2,sc3,sc4,sc5,sc6, nrow = 3)

From this set of plot, we can tell that there is no obvious relationship between risk factors and diseases. But we can still find some tiny influence of risk factors on diseases. For example, when the percentage of few fruits and vegetables and the percentage of smoker go up, there are more percentage of diseases leading to death.

II. Risk Factors Analysis of the United States

a.Distribution of Risk Factor

### boxplot
New<-gather(Health, key="Categories", value = "Percentage", colnames(Health%>%select(19:24)))
New<-na.omit(New[,c(24,25)])
ggplot(New) + geom_boxplot(aes(y=Percentage,x=reorder(Categories, Percentage, median)))+
  ylab("Percentage")+xlab("Risk Factors")+ggtitle("Boxplot of Risk Factors")+
  coord_flip()

## ridge
ggplot(New)+
  geom_density_ridges(aes(x=New$Percentage, y=reorder(New$Categories, New$Percentage, median)))+xlab("Percentage")+ylab("Risk Factors")+ggtitle("Ridgeline Plot of Risk Factors")

Both boxplot and ridgeline plot show the distribution of risk factors, taking counties as observations.
1. Few fruits/vegetables is a serious problem in the United States. The overall rate is higher compare to other risk factors. Most counties have high Few fruits/vegetables rate.
2. Diabetes rate is lower compare to other risk factors. In addition, the distribution has a higher peak and small spread, meaning that the rate of diabetes among all counties are pretty similar.

b.Relationship between Risk Factors and all States (PCA)

## PCA biplot
HS.pc<- as.data.frame(Health.State[,c(9:14)])
rownames(HS.pc) <- Health.State$CHSI_State_Name
pr.out<-prcomp(na.omit(HS.pc),scale=TRUE)
biplot(pr.out, expand=1.3, cex=c(0.6,0.6), main="Biplot of Risk Factors (Scaled)", xlim=c(-0.4, 0.4),
 ylim=c(-0.5,0.5), col=c("orange","blue"))

The biplot is an enhanced scatterplot that uses both points and vectors to represent structure. Samples are displayed as points while variables are displayed as vectors. It applies principal components analysis, which helps us display high dimensional data with lower dimensions (in this case, two dimensions), without losing key characteristics. The axes of a biplot are a pair of principal components. If two points are close to each other, meaning they are alike in risk factors. The arrows represent the variables. Along the direction means a higher rate.
1. Most arrows point toward the same direction, meaning that those variables are highly correlated, besides the factor Few fruits/vegetables.
2. The states on the opposite direction of the arrow, like Colorado and Minnesota, have lower rate of risk factors.
3. West Virginia has high risk factor rates since it locates in the same direction with those arrows.

III. Relationship Between Risk Factors and Health Information

– For Example: no exercise percentage and Average Life Expectancy (ALE)

a.Map of no exercise and Map of ALE

#average life expectancy ALE
countycode<-substr(paste("00",Health[,2],sep=''),nchar(paste("00",Health[,2],sep=''))-2,nchar(paste("00",Health[,2],sep='')))
code<-as.numeric(paste(Health[,1],countycode,sep=''))

county.ALE<-data.frame(region=as.numeric(paste(Health[,1],countycode,sep='')),value=Health[,15])
county_choropleth(county.ALE, 
                  title  = "County Data, Average Life Expectancy (ALE)", 
                  legend = "ALE")

county.No_Exercise<-cbind.data.frame(region=code,value=Health[,19])
county_choropleth(county.No_Exercise, 
                  title  = "County Data, No_Exercise", 
                  legend = "No_Exercise")

In the map, the lighter blue represents a lower value, and darker blue represents a higher value. Black indicated missing values.
1. The first map shows a shorter life expectancy in the Southeast of the United States. Middle North and the Western United States have longer life expectancy.
2. The second map displays a lower no exercise rate in the Middle North and West, while Southeast has higher no exercise rate.
Thus, we may conclude that the rate of no exercise has some correlation with the Average life expectancy.

b. Heatmap of no exercise V.S. ALE

p1<-plot_ly(x=Health$ALE,
        y=Health$No_Exercise
          )%>%
  layout(
    title = "Heatmap no exericse VS. Average Life Expectancy",
      xaxis = list(title = "Life Expectancy"),
      yaxis = list(title = "No Exercise rate")
    )
p1 %>% add_histogram2d(nbinsx=30, nbinsy=30)
From this heatmap, the x-axis is the life expectancy, and the y-axis is the no exercise rate. The brighter color (yellow) means more observations. We can roughly see a decreasing pattern, which implies no exercise and Average Life Expectancy are negatively correlated.

c. Heatmap of Few_Fruit_Veg V.S. ALE

p2<-plot_ly(x=Health$ALE,
        y=Health$Few_Fruit_Veg
          )%>%
  layout(
    title = "Heatmap Few Fruit and Veg VS. Average Life Expectancy",
      xaxis = list(title = "Life Expectancy"),
      yaxis = list(title = "Few Fruit and Veg")
    )
p2 %>% add_histogram2d(nbinsx=30, nbinsy=30)
From this heatmap, the x-axis is the life expectancy, and the y-axis is the rate of Few_Fruit_Veg. We can observe that there doesn’t exist an obvious linear relationship between these two variables, but we can know that a relatively higher percentage of Few_Fruit_Veg will cause a shorter life expectancy. Especially, when the percentage is around 77%, life expectancy is around 77.
* More relationship between risk factors and health information can be explored by using Shiny App

IV. Disease Analysis of the United States

a.States Level Diseases Death Analysis (Cleveland Dot Plot)

## Cleveland

ggplot(tidydf, aes(Percentage, fct_reorder2(`CHSI_State_Name`, Diseases=="CHD", Percentage, .desc = FALSE), color = Diseases)) +
  geom_point() +
  ggtitle("Diseases sorted by Coronary Heart Disease Percentage") + ylab("") 

From the Cleveland Dot Plot ordered by Coronary Heart Disease (CHD), we can observe all States’ five kinds of diseases death percentage. From this ordered dot plot we find three distinctive groups of all five diseases. Breast cancer (female) and the colon cancer are in the first group in the lowest level of percentage to cause death. Lung cancer and stroke are in the second group with the medium level of percentage to cause death and we can still observe that stroke death is slightly higher than Lung cancer. Coronary Heart Disease (CHD) is the highest disease that causes death for all States and the number of CHD’s percentage is usually far higher than other causes.
So, from the dot plot we can conclude that Coronary Heart Disease is the most serious problem from the respect of diseases causing death.

b.Diseases Death Analysis based on Maps

     -- For Example: Coronary Heart Disease (CHD) County Level Map
county.CHD<-cbind.data.frame(region=code,value=Health[,27]/Health[,7]*100)
county_choropleth(county.CHD, 
                  title  = "County Data, Coronary Heart Disease (CHD)", 
                  legend = "CHD")
## Warning in super$initialize(map.df, user.df): Your data.frame contains the
## following regions which are not mappable: 2201, 2232, 2280
## Warning in self$bind(): The following regions were missing and are being
## set to NA: 31075, 31115, 2105, 49009, 48033, 48301, 2198, 15005, 31005,
## 48269, 2060, 2282, 48261, 8014, 30069, 31117, 8053, 8111, 2195, 2230, 2068,
## 2013, 2275, 16025

From the above conclusion, we know that Coronary Heart Disease is the most serious problem, so in this part, we will go further and study the distribution of Coronary Heart Disease from County level based on maps.
From the map, we can observe that Coronary Heart Disease is more serious in the middle of United States because it is obvious that in the middle of the map the color is deeper.
The southwest and east coast have a lower percentage of Coronary Heart Disease problem.
The northwest and the southeast have a higher percentage of Coronary Heart Disease problem when compared with the situation of The southwest and east coast.
* More Information about Diseases Death can be explored by using Shiny App

VI. Interactive component

In the interactive plot, we can further explore the relationship between all variables and also can get more information about each variable’s distribution of geographical location.

VII. Conclusion

The limitation of this study is that there are so many estimated values and missing values. If we can overcome those problems, we can get better and more accurate results.
In the future, we want to study the other variables in the dataset. In particular, we could add gender and races in our analysis and see what will happen. We can see whether those are influential variables.
The lesson people can take from this report is that in order to have a longer life expectancy, we need to have a healthier lifestyle. Doing more exercises, eating more fruit and vegetables, smoke less, and having less sugar are good ways. We also know that Coronary Heart Disease is the most series problem in the United States.

Link to our GitHub Repo: Yiyao Hu: https://github.com/hyy7/final-project-group7 Yakun Wang: https://github.com/yw3211/GR5293--Final-Project

Work Cited: Data Source: https://catalog.data.gov/dataset/community-health-status-indicators-chsi-to-combat-obesity-heart-disease-and-cancer Variable Explanation: https://permanent.access.gpo.gov/fdlp670/CHSI-Data_Sources_Definitions_And_Notes.pdf